Distributionally robust learning-to-rank under the Wasserstein metric

Despite their satisfactory performance, most existing listwise Learning-To-Rank (LTR) models do not consider the crucial issue of robustness. A data set can be contaminated in various ways, including human error in labeling or annotation, distributional data shift, and malicious adversaries who wish to degrade the algorithm’s performance. It has been shown that Distributionally Robust Optimization (DRO) is resilient against various types of noise and perturbations. To fill this gap, we introduce a new listwise LTR model called Distributionally Robust Multi-output Regression Ranking (DRMRR). Different from existing methods, the scoring function of DRMRR was designed as a multivariate mapping from a feature vector to a vector of deviation scores, which captures local context information and cross-document interactions. In this way, we are able to incorporate the LTR metrics into our model. DRMRR uses a Wasserstein DRO framework to minimize a multi-output loss function under the most adverse distributions in the neighborhood of the empirical data distribution defined by a Wasserstein ball. We present a compact and computationally solvable reformulation of the min-max formulation of DRMRR. Our experiments were conducted on two real-world applications: medical document retrieval and drug response prediction, showing that DRMRR notably outperforms state-of-the-art LTR models. We also conducted an extensive analysis to examine the resilience of DRMRR against various types of noise: Gaussian noise, adversarial perturbations, and label poisoning. Accordingly, DRMRR is not only able to achieve significantly better performance than other baselines, but it can maintain a relatively stable performance as more noise is added to the data.


Introduction
There exist many real-world applications such as recommendation systems, document retrieval, machine translation, and computational biology where the correct ordering of instances is of equal or greater importance than minimizing regression or classification errors [1]. Learning-to-rank (LTR) refers to a group of algorithms that apply machine learning techniques to tackle these ranking problems. Generally speaking, LTR methods learn a scoring function that maps an instance-query feature vector to a relevance score (i.e., multi-level rating/label) that is then used to rank instances for a given query. Ideally, the resulting ranked list should maximize a ranking metric [2][3][4]. We considered two medical applications of LTR, namely medical document retrieval and drug response prediction. Healthcare applications commonly face various challenges including: (i) susceptibilities in data collection due to instrument and environmental noise or data entry errors; (ii) ambiguous or improper data annotation; (iii) lack of large-scale data for training and testing of algorithms; (iv) imbalanced data sets; (v) missing data; (vi) divergence of training and testing data distributions (e.g., data is recorded by different hospitals using different procedures); and, more importantly, (vii) the threat of adversarial attacks [5][6][7]. Consequently, robustness is critical for the wider adoption and deployment of algorithms into healthcare systems [7]. In this work, and without loss of generality, we take document retrieval as an example to explain the concepts and formulations. The main goal of document retrieval is to rank a set of documents by their relevance to a query. A slightly different example in computational biology is drug response prediction. For instance, prescribing the right therapeutic option for each cancer patient is an intricate task since the efficacy of cancer medications varies among patients. Nevertheless, the biological differences among cancers can be used to design genomic predictors of drug responses from large panels of cancer cell lines [8]. In drug response prediction, large-scale screenings of cancer cell lines against libraries of pharmacological compounds are used to predict precise and individualized medications [8].
Existing LTR approaches fall into three categories, namely pointwise, pairwise, and listwise [9]. The pointwise approach formulates ranking as a classification or regression problemmost early LTR algorithms such as linear regression ranking [9] or RankNet [10] take a very similar approach. In the pairwise approach, a classification method is employed to classify the preference order within document pairs. Representative pairwise ranking algorithms include RankBoost [11], RankNet [10], and ordinal regression [9]. Both approaches are misaligned with the ranking utilities such as Normalized Discounted Cumulative Gain (NDCG) and do not straightforwardly model the ranking problem. The listwise models can overcome this drawback by taking the entire list of retrieved documents from a query as instances and train a ranking function through the minimization of a listwise loss function. Experimental results show that the listwise approaches generally outperform the pointwise and pairwise algorithms [12]. The literature offers a variety of approaches from deriving a smooth approximation to ranking utilities (e.g., ApproxNDCG [13] and SoftRank [14]), to constructing differentiable surrogate loss functions (e.g., ListMLE [15], LambdaMART [16], and ListNet [12]). Specifically, ListNet and ListMLE try to learn the best document permutation based on permutation probabilities via the Plackett-Luce model while SoftRank and ApproxNDCG use ranking metrics or positions to tune their loss functions. On the other hand, LambdaMART employs heuristics to compute the gradients of an unknown loss function directly.
Most existing studies on LTR achieve impressive performance but often neglect the importance of robustness [9]. Systematic noise can become part of a data set in many ways and deceive LTR models to rank an item at an incorrect position with high confidence. While Empirical Risk Minimization (ERM) has been effective to optimize loss, ERM often does not yield models that are robust to adversarially crafted samples [17]. Distributionally Robust Optimization (DRO) is a modeling paradigm for data-driven decision-making under uncertainty. It has been successful in handling problems with corrupted training data through hedging against the most adverse distribution within a Wasserstein ball [18]. Recently, DRO has been an active area of research owing to its robustness to adversarial examples, rigorous out-of-sample and asymptotic consistency guarantees, and excellent empirical performance [19].
In the present work, we seek to infuse robustness into LTR problems through the DRO framework. Equipped with this perspective, we make the following contributions. Unlike other LTR frameworks, our algorithm approaches listwise ranking in a novel way and employs ranking metrics (i.e., NDCG) in its output. In particular, we use the notion of position deviation to define a vector of relevance scores instead of a scalar. We then adopt the DRO framework to minimize a worst-case expected multi-output loss function over a probabilistic ambiguity set that is defined by the Wasserstein metric. To the best of our knowledge, ours is the first study that utilizes a multi-output Wasserstein DRO framework to robustify LTR problems. We present an equivalent convex reformulation of the DRO problem, which is shown to be tighter than earlier work [18]. In experiments, our approach yields state-of-the-art results in two challenging applications of LTR, namely medical document retrieval and drug response prediction. More importantly, we evaluate our model to verify its robustness against various types of attacks including adversarial attacks and label attacks, showing that our model maintains a consistently good performance under various attack scenarios.

Notational conventions
We use boldfaced lowercase letters to denote vectors, ordinary lowercase letters to denote scalars, boldfaced uppercase letters to denote matrices, and calligraphic capital letters to denote sets. All vectors are column vectors. For space saving reasons, we write x to denote the column vector (x 1 , . . ., x dim(x) ), where dim(x) is the dimension of x. We use prime to denote the transpose, N for the set {1, . . ., N} for any integer N, k � k p for the ℓ p norm with p � 1, and I K for the K-dimensional identity matrix. For a matrix A 2 R m�n , we use kAk p to denote its induced ℓ p norm, defined as kAk p ≜sup x6 ¼0 kAxk p =kxk p .

Learning-to-rank
In a ranking problem, the data consists of a set of triples (query, document, relevance score). A feature vector is used to represent a query-document pair. The relevance score indicates the degree of relevance of this document to its corresponding query. Given a ranking data set fðX q ; θ q Þg T q¼1 , q 2 T indexes a query, and X q and θ q represent the list of retrieved documents and corresponding relevance scores, respectively. The q-th query contains n q documents and X q 2 R n q �p has rows ðx q 1 ; � � � ; x q n q Þ, each of which is a p-dimensional document feature vector. The vector θ q ¼ ðy q 1 ; � � � ; y q n q Þ 2 R n q þ contains the corresponding ground-truth relevance scores, where a higher y q d 2 R indicates that the document with features x q d is more relevant. In the learning-to-rank framework, denoting by x and θ the random variables that represent the document feature vector and relevance score, respectively, the goal is to learn a scoring function f that best predicts the relevance score: where ' : R � R ! R is a loss function, f : R p ! R predicts the relevance score of each document, and P * is the underlying true probability distribution of (x, θ). Given that P * is unknown, most existing LTR algorithms solve (1) through estimating the expected loss by its empirical substitute (2):L For a test query X t 2 R n t �p consisting of n t documents, the final predicted ranking listπ is simply obtained by ranking the rows in X t based on their inferred ranking scoreŝ . . . ; f ðx t n t ÞÞ. Eq (2) is restrictive in the sense that: (i) it does not take into account the inter-dependency of scores between documents, and (ii) the empirical estimate is very sensitive to data perturbations.

Distributionally Robust Optimization
Distributionally Robust optimization (DRO) hedges against a set of probability distributions instead of just the empirical distribution. DRO minimizes a worst-case loss over a probabilistic ambiguity set: where the ambiguity set O can be defined through moment constraints [20], or as a ball of distributions using some probabilistic distance function such as the Wasserstein distance [21,22]. The Wasserstein DRO model has been extensively studied in the machine learning community; see, for example, [23,24] for robustified regression models, [19] for adversarial training in neural networks, and [25] for distributionally robust logistic regression. These works, [18,26,27] provided a comprehensive analysis of the Wasserstein-based distributionally robust statistical learning framework.

Problem formulation
Next, we introduce our DRO formulation of the LTR problem. Different from the existing works where a univariate relevance score y q d 2 R is used for each document x q d 2 R p , we define a Ground Truth Deviation vector θ q d 2 R K to characterize different levels of importance for the document x q d in the q-th query. Here, K is a constant to be defined later (cf. end of the next section). We also derive an equivalent reformulation of the DRO problem.

Ground Truth Deviation
As a popular evaluation criterion in information retrieval, Normalized Discounted Cumulative Gain (NDCG) can deal with cases that have more than two degrees of relevancy for documents [28]. Let D(s) = 1/log(1 + s) be a discount function, G(s) = s, a monotonically increasing gain function, and Z n ¼ fðx 1 ; y 1 Þ; :::; ðx n ; y n Þg a set of documents ordered according to their ground-truth rank, with x i and y i being a document feature vector and a relevance score, respectively. AssumeZ n is a (predicted) ranked list for Z n ; then the Discounted Cumulative Gain (DCG) ofZ n is defined as FðZ n Þ ¼ P n r¼1 Gðy p r ÞDðrÞ, where π r is the index of the document ranked at position r ofZ n . The reason for introducing the discount function is that the user cares less about documents ranked lower [29]. NDCG normalizes DCG by the Ideal DCG (IDCG), F I ðZ n Þ, which is the DCG score of the ideal ranking result [30] and can be computed by F N ðZ n Þ ¼ FðZ n Þ=F I ðZ n Þ 2 ½0; 1�. Considering the q-th query (X q , y q ) that contains n q documents, we define a Ground Truth Deviation (GTD) vector for document d as follows: where � is the Hadamard product (a.k.a. the element-wise product). The vector θ q d is comprised of the following three components.
NDCG deviation score (ξ F ). To compute this vector, first, the elements of y q ¼ ðy q 1 ; . . . ; y q n q Þ are sorted in descending order of their ground truth individual relevance scores, and the document feature vectors X q ¼ ðx q 1 ; . . . ; x q n q Þ are also sorted correspondingly. We denote them by � y q and � X q , respectively. The NDCG score for � X q is equal to 1. If we switch two documents in � X q , the NDCG will decrease or in some cases may stay the same (i.e., if their relevance scores are equal). For document d in query q, we define the NDCG deviation score vector as ξ F ¼ ðl d1 ; . . . ; l dn q Þ where λ di is the NDCG score of � X q when we switch the position of document d with the document that is in i-th position of � X q and can be formulated as follows: Here, p À 1 d is the position of the document d in � X q , π i is the index of the document ranked at the i-th position of � X q , and F I is the IDCG. The details about the derivation can be found in the S1 Appendix. We can perceive the i-th element of the GTD vector as a score that indicates the degree of congruence between a document and the i-th rank.
Position deviation score (ξ D ). This vector is defined to further push the relevant documents to the top of the ranking list and penalize documents based on their position in the ranking list. The position deviation score works in conjunction with ξ F . We define it as ξ D ¼ ðr d1 ; . . . ; r dn q Þ where ρ di can be calculated by r di ¼ a ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi As can be seen in Fig 1, α specifies the GTD's maximum score and β regulates the magnitude of the penalty for a position deviation. Here, we use the red dashed curve for positive deviations (i.e., when a document ranked higher than its optimal position) and the black curve for negative deviations. This would induce our model to tolerate a positive deviation more than a negative one. Consequently, the model pushes the relevant documents to the top of the ranking list.

PLOS ONE
Document importance score (ξ I ). It is defined to place greater emphasis on highly relevant documents and can be computed as whereŷ is the maximum possible value for relevance scores. Fig 1 presents ξ I for different values ofŷ.
Ultimately, instead of a relevance score for each document, we have a GTD vector. The GTD vector characterizes different levels of importance for a document in a query where the first element is the first level of importance, the second element is the second level of importance, and so on. Since each query may have a different number of documents, we just consider the first K elements of ξ F and ξ D in our model, corresponding to K levels of importance. In this way, all GTD vectors are of the same length. We prefer to use a low value for K since it forces the model to focus on the most relevant documents. In case a large K needs to be used and K > n q , we can simply repeat the last element of ξ F and ξ D to pad our ξ F vector.
In a nutshell, the NDCG deviation score (ξ F ) captures the relative position of a document in a query. On the other hand, the position deviation score (ξ D ) and the document importance score (ξ I ) work in conjunction to push the relevant documents to the top of the list. We used an asymmetric bell-shaped function for the position deviation score to give a maximum score to correctly ranked documents. By using a "steeper left fall," we give a lower score to a negative position deviation (i.e., when a document ranked lower than it should) compared to a positive one. Moreover, α and β enable us to control the maximum score and the magnitude of the penalty for a position deviation, respectively. In the S1 Appendix, we present an ablation study to gauge their effect on performance. We also provid an example of GTD vector calculation.

Distributionally Robust Multi-output Regression
Consider a setting where there are K levels of importance with features and importance scores distributed according to x 2 R p and θ 2 R K , respectively. We restrict our attention to linear function classes by assuming f(x) = B 0 x where B 2 R p�K . The matrix B characterizes the dependency structure of the different levels of importance. Nonlinearity can be introduced by applying a transformation (e.g., kernel function) on the feature x. The Distributionally Robust Multi-output Regression Ranking (DRMRR) formulation minimizes the worst-case expected loss as follows: where ' : R K ! R is a Lipschitz continuous loss function on the metric spaces ðD; k � k r Þ and ðC; j � jÞ, where D, C are the domain and co-domain of ℓ(�), respectively. In (4), Q 2 O≜fQ 2 PðSÞ : W 1 ðQ;P N Þ � εg is the probability distribution of (x, θ), where PðSÞ is the space of all probability distributions supported on S and S is the uncertainty set of (x, θ), ε is a positive constant (i.e., Wasserstein ball radius),P N is the empirical distribution that assigns an equal probability to all N training samples, with N ¼ P T j¼1 n j , where T is the number of queries, and W 1 ðQ;P N Þ is the order-1 Wasserstein distance between Q andP N defined as In the distance, δ(z 1 − z 2 ) ≜ kz 1 − z 2 k r with z i = (x i , θ i ), i = 1, 2, drawn from Q andP N , respectively, and P specifies the joint distribution of z 1 and z 2 with marginals Q andP N . Note that the same norm is used to define the Wasserstein metric and the domain space of ℓ(�). In the following theorem we propose an equivalent reformulation of (4) by using duality for the inner maximization problem. Theorem 0.1. Suppose our dataset consists of T queries fðX q ; Θ q Þg T q¼1 and each query q contains n q documents, q 2 T, where X q 2 R n q �p is the document feature matrix with rows x q d 2 R p , d 2 n q , and Θ q 2 R n q �K is the GTD matrix with rows θ q d 2 R K . Define a loss function 'ð�Þ≜k � k r . If the Wasserstein metric is induced by k � k r , the DRMRR problem (4) can be equivalently reformulated as: where r, s � 1; 1/r + 1/s = 1;B ¼ ðÀ B 0 ; I K Þ.
The proof can be found in S1 Appendix. Thm. 0.1 establishes a connection between distributional robustness and regularization, which has also been studied by, e.g., [22,25,26]. However, most of the existing studies focused on a univariate output. By contrast, our work adapts the DRO framework to a multi-output setting, which is more suitable for the ranking problem. Recently, [18] studied a multi-output regression problem under the Wasserstein DRO framework. However, our results in Theorem 0.1 present a tighter reformulation than theirs (Eq. (6.2) in [18]. In the case where the Wasserstein metric is induced by the ℓ 2 norm (r = 2), Eq (5) yields a regularizer which is the spectral norm (largest singular value) ofB 0 , while [18] derived a regularizer in the Frobenius norm which is looser.

Score calculation
Suppose we are given a test query X t ¼ ðx t 1 0; . . . ; x t n t 0Þ 2 R n t �p ; we can estimate the GTD matrix asΘ t ¼ ðB 0 x t 1 ; . . . ; B 0 x t n t Þ 2 R n t �K . In the matrixΘ t , columns correspond different ranks and rows refers to different documents. Algorithm 1 demonstrates the procedure of ranking using the output of the DRMRR algorithm where R K (j) is the remainder of dividing j by K. In the S1 Appendix, we present an intuitive toy example to illustrate this algorithm better.

Experiment setup
Data sets. We conducted experiments on two publicly available benchmark datasets: OHSUMED(https://www.microsoft.com/en-us/research/project/letor-learning-rankinformation-retrieval/), and Drug Response Prediction (DRP)(https://modac.cancer.gov/ assetDetails?dme_data_id=NCI-DME-MS01-8088592). As a subset of the MEDLINE database (a database on medical publications), the OHSUMED corpus [31] consists of about 0.3 million records from 270 medical journals from 1987 to 1991. A query set with 106 queries on the OHSUMED corpus has been extensively used in previous works, in which each query is represented by 45 features [2]. There are in total 16,140 query document pairs with relevance judgments. LETOR [2] defined three ratings 0, 1, 2, corresponding to "irrelevant," "partially relevant," and "definitely relevant," respectively. In addition to OHSUMED, we trained and evaluated our method using the cell line data and drug sensitivity data from the Cancer Cell Line Encyclopedia (CCLE) [32] and the Cancer Therapeutics Response Portal (CTRP v2) [33]. A total of 332 cell lines (i.e., queries) and 50 drug responses were used. The "Act Area" (the area above the fitted dose-response curve) was used to quantify drug sensitivity where a lower response value indicates higher drug sensitivity. After several pre-processing steps, cell lines are represented by 251 numeric features (i.e., genes) and drug sensitivities are labeled with graded relevance from 0 to 2 (i.e., "insensitive," "sensitive," and "highly sensitive," respectively) with larger labels indicating a higher sensitivity. Further details of the data pre-processing steps can be found in S1 Appendix. Moreover, all code written in support of this publication is publicly available on a GitHub repository(https:// github.com/noc-lab/DRMRR-Distributionally-robust-learning-to-rank-under-the-Wasserstein-metric). Please note that we targeted biomedical applications with limited data. Since the number of drug-cell line pairs is much less than the number of features, most approaches "overfit." Similarly, OHSUMED challenges ranking models due to its small sample size.
Evaluation metrics. We evaluated model performance using two metrics: NDCG@k and AP@k. NDCG@k is the top-k version of NDCG, where the discount function is D(s) = 0 for s > k. Precision at position k (P@k) is the fraction of relevant documents in the top-k. Suppose we have binary relevance for the documents in a q-query; we define P@k as P@k ¼ 1 k P k j¼1 1ðy p j ¼ 1Þ where 1ð�Þ is the indicator function. We define Average Precision at position k as AP@k ¼ 1 m P k j¼1 P@j � 1ðy p j ¼ 1Þ, where m is the total number of relevant documents in the top-k of the ranking list. AP is a highly localized performance measure and captures the quality of rankings for applications where only the first few results matter. The main difference between AP and NDCG is that NDCG differentiates between "partially relevant" and "definitely relevant" documents while AP treats them equally. Given a set of testing queries and a performance metric, we are interested in the mean metric which is simply the mean of the performance metric for all queries. From now on, we use NDCG@k and AP@k to denote mean NDCG@k and mean AP@k, respectively.
Thus, we rely on prior research [4,41,44] and do not include the weaker methods in our experiments. It is important to note that the author of XE-MART NDCG proposed this model as a robust alternative to LambdaMART-based models. We also compared DRMRR against the state-of-the-art transformer-based neural ranking model [45] with different loss functions. However, since the performance of the aforementioned tree-based baselines was by far better than the latter (especially on our main application, namely DRP), we defer the presentation of the performance of the latter methods to the S1 Appendix.
Experimental settings and hyper-parameter optimization. In our experiments, we used the standard supervised LTR framework [9]. Authors of LETOR [2] partitioned the OHSUMED data set into five parts for five-fold cross-validation where three parts were used for training, one part for validation (i.e., tuning the hyperparameters of the learning algorithms), and the remaining part for evaluating the performance of the learned model. Similarly, we partitioned the drug response data set into five folds and conducted five-fold crossvalidation to train, validate, and evaluate the ranking algorithms. In all experiments, the average on the test set over the 5 folds was reported. Algorithm parameters were tuned on the validation sets. We optimized the algorithm parameters to maximize NDCG@5 and NDCG@10. The details of the parameter-tuning procedure and the optimal parameters for each algorithm can be found in the S1 Appendix.

Overall comparison
We compared the performance of DRMRR on OHSUMED, and DRP data sets with baseline methods introduced in the previous sections. The results are in Table 1. The values inside the parentheses denote the Standard Deviation (SD) of the corresponding metrics. Bold numbers indicate the best performance among all methods for each metric. DRMRR consistently outperforms all baseline methods across all metrics. In our experiment on OHSUMED data, LambdaMART NDCG demonstrated a reasonably good overall performance and it is the second-best method. However, XE-MART NDCG was the second-best method in our experiment on the DRP data. The difference between the best and the second-best methods for the DRP data set is greater than what we obtained for OHSUMED. Due to the limited number of samples available and the specific structure of the DRP data, the performance of the baseline methods diminished significantly. On the other hand, DRMRR was able to maintain its high performance. To sum up, the proposed method is not only able to push the most relevant documents (or sensitive drugs) to the top of the ranking list, but it can put them in the right order. Furthermore, as we discuss in the Supplement, our model is more efficient (low model complexity) and generalizes better (typically, the generalization error increases with model complexity).

Robustness comparison
In this section, we empirically study the behavior of DRMRR in the presence of noise. While our overall performance analysis suggested that DRMRR should be the most "well-behaved" of the four, that analysis was performed on the clean data. The robustness of a ranking model to noise is crucial in practice, especially in the healthcare domain. We put this hypothesis to test through four types of experiments. We conducted all experiments on the OHSUMED data set since it is a popular and standard LTR data set. In all experiments, the values are the average of 5 folds. Gaussian noise attack. We added Gaussian noise to the test documents to deliberately corrupt them; therefore, depreciating their predictability. Gaussian noise was added to 75% of the test queries randomly. Experiments were conducted using various means and a fixed standard deviation of 0.001. We used the perturbed test data to evaluate the trained models (i.e., all algorithms were tested on the same perturbed test data). Fig 2 demonstrates the performance of the algorithms on the perturbed test data. Two observations are in order: (i) DRMRR outperformed the baseline models at different levels of noise; and (ii) DRMRR demonstrated a relatively stable performance.
Universal adversarial perturbation attack. We built an adversarial model to introduce perturbations that break the neighborhood relationships by altering the input slightly. To that

PLOS ONE
end, a pointwise linear regression ranking model was trained as the adversarial model on the clean training set. Then, 75% of the test queries were perturbed using the coefficient of the adversarial model and the Fast Gradient Sign Method (FGSM) method: x d q is the perturbed feature vector, σ controls the magnitude of the perturbations, and J is the cost function of the adversarial model [46].
All algorithms that we trained in the "Overall Comparison" section were evaluated on the same perturbed test data. In this case, the adversary had no knowledge of the ranking models; however, it was trained on the same training data. Fig 3 shows the performance of the algorithms on the perturbed test data. As we increase the level of perturbations (i.e., σ), we can see that DRMRR is less sensitive to adversarial perturbations in comparison with the competing methods. It demonstrated a stable performance across all metrics. Among the baselines, XE-MART NDCG that performed well in terms of NDCG@5, demonstrated poor performance in terms of AP@5.
Black-box adversarial attack. The black-box adversarial attack restricts the attacker's knowledge only to the deployed model [47]. The setting of black-box attacks is closer to the real-world scenario; therefore, this is the most practical experiment to measure the robustness of our algorithm. Please refer to [47] for more information on the black-box adversarial attacks. Since the adversary has no access to the model's weights and parameters, the adversary can choose to train a parallel model called a substitute model to imitate the original model. Here, we use a four layers fully connected network as our substitute models (see the S1 Appendix for more details). To construct the substitute models, the training data were independently fed to each model and the output was observed. Then, for each algorithm, a Neural Network (NN) ranking model was trained as an adversarial substitute model using the training feature vectors and the observed output of that specific algorithm. Subsequently, 75% of the test queries were perturbed using the FGSM method and the parameters of the substitute model corresponding to each algorithm. We trained four substitute models corresponding to each ranking algorithm. We used the specific perturbed test data to evaluate the best trained models (i.e., each model has a different perturbed test set). Fig 4 demonstrates the performance of the algorithms on the perturbed test data. The values are the average of 5 folds. We can see that both figures show the same trend-increasing the level of perturbations (i.e., σ) leads to significant differences between DRMRR and the baselines methods. The competing methods were greatly affected by this type of noise, whereas their performance was modest in the simpler experiments, namely universal adversarial attack and Gaussian attack. We conclude that DRMRR is robust to adversarial perturbations, an important property that leads to good generalization ability.

PLOS ONE
Label attack. In practice, the vagueness of query intent, insufficient domain knowledge, and ambiguous definition of relevance levels make it hard for human judges to assign proper relevance labels to some documents. Practically speaking, the probability of judgment errors in various relevance degrees is not equal. Even if human annotators misjudge a document, they are more probable to label it closer to its ground-truth label. Inspired by [48], we define the non-uniform error probabilities in Table 2 where entries of this table correspond to the probability that a document with ground-truth label y q i is prone to be labeled as y q j . We randomly changed the labels of the training data using the probabilities in Table 2. Then, each model was trained on the noisy training data. Clean test data were used to evaluate each model. We conducted two sets of experiments, namely low label noise (i.e., e = 0.85) and high label noise (i.e., e = 0.7). Table 3 reports the performance of the algorithms on the clean test data. The values in these figures are the average of 5 folds. For the low noise scenario, the differences between the average AP@5 and NDCG@5 of the baseline models and DRMRR were 2.53% and 1.59%, respectively. Notably, the gaps were even larger for the high noise scenario (AP@5 = 3.08%, NDCG@5=1.68%). Since noise in human-labeled data is an inevitable issue, we can argue that the baseline models are susceptible and degrade more severely as more noise is added to the training set.

Discussion and conclusion
This paper went beyond conventional listwise learning-to-rank approaches and introduced a distributionally robust learning-to-rank framework with multiple outputs, referred to as DRMRR. Unlike existing methods, the scoring function in DRMRR was designed as a multivariate mapping from a feature vector to a vector of deviation scores (a.k.a. GTD vector). The GTD vector captures local context information and cross-document interactions. Moreover, we formulated DRMRR as a min-max problem where one minimizes a worst-case expected loss over a probabilistic ambiguity set. The ambiguity set was defined as a ball of distributions using the Wasserstein metric. Notably, we presented a compact and computationally solvable equivalent reformulation of the min-max formulation of DRMRR. We compared DRMRR with the baseline models in terms of: (a) the overall performance on two real-world applications and (b) the robustness to various types and degrees of noise. In medical document retrieval, DRMRR outperformed state-of-the-art LTR models and established its capability in differentiating relevant documents from irrelevant ones. In drug response prediction, our results indicated that DRMRR leads to substantially improved performance when compared to the competing methods across all performance metrics. Thus, DRMRR can infer robust predictors of drug responses from patient genomic or proteomic profiles which can lead to selecting a highly effective personalized treatment. In our robustness evaluations, we conducted a comprehensive analysis to assess the resilience of DRMRR against various types of noise and perturbations. Experimental results demonstrated that DRMRR is effective against: (i) Gaussian noise; (ii) universal adversarial perturbations by a substitute model with no knowledge of the victim model; (iii) black-box adversarial perturbations by a substitute model with access only to the deployed victim model; and (iv) probabilistic perturbation of relevance labels. Interestingly, the performance of DRMRR was consistently better than the baseline methods for all levels of noise. More importantly, DRMRR showed no significant change in its performance with the increase in the noise intensity. Two attributes of DRMRR did help to enhance its performance and robustness: (i) efficiently capturing the contextual information and interrelationship between documents/drugs via the GTD vector; and (ii) the distributional robustness by hedging against a family of plausible distributions, including the true distribution with high confidence. Even though DRMRR demonstrated promising performance, it also suffers from some limitations that can be addressed in future work. DRMRR solves a convex problem which can be done very efficiently with 1st-order gradient methods. Its computational complexity is comparable to the training of leaf nodes in tree models (or the last layer of a neural network model), where a simple regression model is being trained. However, listwise ranking models can get relatively complex compared to pointwise or pairwise approaches and DRMRR is not an exception. One possible direction is to reformulate the problem to speed up the solutions to the DRO problem considered in this paper. As for the DRP application, an interesting future direction is to incorporate the toxicity of drugs in our predictions. Since the biological dissimilarities among patients affect the side effects of medications, patients may have various side effects. Hence, we can improve our predictions by considering the side effects and toxicity of drugs.